Introduction

Sports have been engrained in American (and global) culture as a primary past-time, whether being played or viewed. Sports statistics can provide passionate fans an additional avenue for enjoying their favorite teams. This project walks through several different visualizations on one particularly rich data set as examples of what can be done.

The Data

The data for this project can be found here: Google Big Query. BigQuery is a data warehouse hosted by Google which contains databases about a large variety of topics, including one on NCAA Basketball. The NCAA database contains information about college teams, players, games, and even play-by-play information for more recent seasons. This is an unusually rich data set available for free – something which is becoming increasingly uncommon in sports statistics.

Data for this project covers 5 seasons of Hawkeyes Men’s basketball, from the 2012-13 through 2016-17 seasons, as these are the most recent seasons having detailed play data. Downloading data from Google’s service was fairly straightforward. Data were pulled from two tables: one focused on game-level information, and the other on play-by-play data. Results were filtered to games/plays involving the Hawkeyes and imported to R, where a few additional cleaning steps were performed (removing missing values, calculating shot distances from coordinate data, etc.).

Home vs Away

A good starting point for exploring the Hawkeyes is a comparison of their performance at Carver-Hawkeye Arena compared to when they traveled.

The left plot above illustrates final scores in Hawkeye games. The dotted line represents a tie score. Points to the top left of this line indicate a Hawkeye win, and bottom right points represent losses. It’s evident that Iowa wins a much higher proportion of home games than away games. This is made even clearer with the adjacent chart.

The chart above shows a major contributing factor to the Hawkeyes better home performance is their shooting – it’s improved across the board.

Conference and Opponent Comparisons

How do the Hawkeyes do in Big Ten play compared to games against teams from outside their conference? The following two plots show that they’re about middle-of-the-road in both shooting efficiency and scoring.

Breaking down these same results by conference teams, we can see that the Hawkeyes were fairly consistent in terms of field goal percentage, and more variable when is came to scoring. Interestingly, despite shooting the most effectively against Michigan, the Hawkeyes scored the third least points on average against them. This suggests factors beyond shooting (like rebounding and turnovers) can also have a significant impact on scoring, which is to be expected.

Shot Locations

Shooting three point shots has become increasingly common over the past decade. Do we see this trend begin to emerge over the 5 season span in this dataset?

The ridge plots above show the distribution of distances shots were taken from. There are three modes in the distribution: near the basket, at the free throw line, and just beyond the three-point arc. These distributions look very similar, so it doesn’t appear shooting habits changed much over the timeframe.

With the coordinate data avaiable to us, we can also plot specific shot locations on this court graphic (click to see code):

### Create court backdrop
 
court <- ggplot(data=data.frame(y=1,x=1),aes(x,y))+
    ###outside box:
      geom_path(data=data.frame(y=c(-25,-25,25,25,-25),x=c(-47,47,47,-47,-47)))+
    ###halfcourt line:
      geom_path(data=data.frame(y=c(-25,25),x=c(0,0)))+
    ###halfcourt semicircle:
      geom_path(data=data.frame(y=c(-6000:(-1)/1000,1:6000/1000),x=c(sqrt(6^2-c(-6000:(-1)/1000,1:6000/1000)^2))),aes(y=y,x=x))+
      geom_path(data=data.frame(y=c(-6000:(-1)/1000,1:6000/1000),x=-c(sqrt(6^2-c(-6000:(-1)/1000,1:6000/1000)^2))),aes(y=y,x=x))+
    ###solid FT semicircle above FT line:
      geom_path(data=data.frame(y=c(-6000:(-1)/1000,1:6000/1000),x=c(28-sqrt(6^2-c(-6000:(-1)/1000,1:6000/1000)^2))),aes(y=y,x=x))+
      geom_path(data=data.frame(y=c(-6000:(-1)/1000,1:6000/1000),x=-c(28-sqrt(6^2-c(-6000:(-1)/1000,1:6000/1000)^2))),aes(y=y,x=x))+
    ###dashed FT semicircle below FT line:
      geom_path(data=data.frame(y=c(-6000:(-1)/1000,1:6000/1000),x=c(28+sqrt(6^2-c(-6000:(-1)/1000,1:6000/1000)^2))),aes(y=y,x=x),linetype='dashed')+
      geom_path(data=data.frame(y=c(-6000:(-1)/1000,1:6000/1000),x=-c(28+sqrt(6^2-c(-6000:(-1)/1000,1:6000/1000)^2))),aes(y=y,x=x),linetype='dashed')+
    ###key:
      geom_path(data=data.frame(y=c(-8,-8,8,8,-8),x=c(47,28,28,47,47)))+
      geom_path(data=data.frame(y=-c(-8,-8,8,8,-8),x=-c(47,28,28,47,47)))+
    ###box inside the key:
      geom_path(data=data.frame(y=c(-6,-6,6,6,-6),x=c(47,28,28,47,47)))+
      geom_path(data=data.frame(y=c(-6,-6,6,6,-6),x=-c(47,28,28,47,47)))+
    ###restricted area semicircle:
      geom_path(data=data.frame(y=c(-4000:(-1)/1000,1:4000/1000),x=c(41.25-sqrt(4^2-c(-4000:(-1)/1000,1:4000/1000)^2))),aes(y=y,x=x))+
      geom_path(data=data.frame(y=c(-4000:(-1)/1000,1:4000/1000),x=-c(41.25-sqrt(4^2-c(-4000:(-1)/1000,1:4000/1000)^2))),aes(y=y,x=x))+
    ###rim:
      geom_path(data=data.frame(y=c(-750:(-1)/1000,1:750/1000,750:1/1000,-1:-750/1000),x=c(c(41.75+sqrt(0.75^2-c(-750:(-1)/1000,1:750/1000)^2)),c(41.75-sqrt(0.75^2-c(750:1/1000,-1:-750/1000)^2)))),aes(y=y,x=x))+
      geom_path(data=data.frame(y=c(-750:(-1)/1000,1:750/1000,750:1/1000,-1:-750/1000),x=-c(c(41.75+sqrt(0.75^2-c(-750:(-1)/1000,1:750/1000)^2)),c(41.75-sqrt(0.75^2-c(750:1/1000,-1:-750/1000)^2)))),aes(y=y,x=x))+
    ###backboard:
      geom_path(data=data.frame(y=c(-3,3),x=c(43,43)),lineend='butt')+
      geom_path(data=data.frame(y=c(-3,3),x=-c(43,43)),lineend='butt')+
    ###three-point line:
      geom_path(data=data.frame(y=c(-20.75,-20750:(-1)/1000,1:20750/1000,20.75),x=c(47,41.75-sqrt(20.75^2-c(-20750:(-1)/1000,1:20750/1000)^2),47)),aes(y=y,x=x))+
      geom_path(data=data.frame(y=c(-20.75,-20750:(-1)/1000,1:20750/1000,20.75),x=-c(47,41.75-sqrt(20.75^2-c(-20750:(-1)/1000,1:20750/1000)^2),47)),aes(y=y,x=x))+
    ###fix aspect ratio to 1:1
      coord_fixed()

court

Using this graphic, we can create a heat maps of shot locations. This first plot shows where the Hawkeyes’ shots were taken from.

We can also create a plot for the proportion of shots made, and take it a step further by making it interactive.

Conclusion

One of the things I find most wonderful about sports is that there is no one right way to enjoy them, or to find and consume information about them. Hopefully this brief overview of Hawkeyes basketball has encouraged or inspired you to ask and answer questions of your own.

The queries and code for this project can be found in my GitHub Repository.